An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering
نویسندگان
چکیده
Assignment methods are at the heart of many algorithms for unsupervised learning and clus tering in particular, the well-known K -mean.! and E:z:pectation-Mazimi$ation (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the "soft" assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differences of behavior between the two algorithms. Here we shed light on these differences via an information-theoretic analysis. The cornerstone of our results is a simple decom position of the expected distortion, showing that K-means (and its extension for inferring gen eral parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clus ters. How well the data are balanced is mea sured by the entropy of the partition defined by the hard assignments. In addition to letting us predict and verify systematic differences between K-means and EM on specific examples, the de composition allows us to give a rather general ar gument showing that K-means will consistently find densities with less "overlap" than EM. We also study a third natural assignment method that we call pr)IJterior assignment, that is close in spirit to the soft assignments of EM, but leads to a surprisingly different algorithm.
منابع مشابه
NGTSOM: A Novel Data Clustering Algorithm Based on Game Theoretic and Self- Organizing Map
Identifying clusters is an important aspect of data analysis. This paper proposes a noveldata clustering algorithm to increase the clustering accuracy. A novel game theoretic self-organizingmap (NGTSOM ) and neural gas (NG) are used in combination with Competitive Hebbian Learning(CHL) to improve the quality of the map and provide a better vector quantization (VQ) for clusteringdata. Different ...
متن کاملCombination of real options and game-theoretic approach in investment analysis
Investments in technology create a large amount of capital investments by major companies. Assessing such investment projects is identified as critical to the efficient assignment of resources. Viewing investment projects as real options, this paper expands a method for assessing technology investment decisions in the linkage existence of uncertainty and competition. It combines the game-theore...
متن کاملClustering with Bregman Divergences
A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clust...
متن کاملInformation Theoretic Clustering Using Minimum Spanning Trees
In this work we propose a new information-theoretic clustering algorithm that infers cluster memberships by direct optimization of a non-parametric mutual information estimate between data distribution and cluster assignment. Although the optimization objective has a solid theoretical foundation it is hard to optimize. We propose an approximate optimization formulation that leads to an efficien...
متن کاملModel-based Clustering with Soft Balancing
Balanced clustering algorithms can be useful in a variety of applications and have recently attracted increasing research interest. Most recent work, however, addressed only hard balancing by constraining each cluster to have equal or a certain minimum number of data objects. This paper provides a soft balancing strategy built upon a soft mixtureof-models clustering framework. This strategy con...
متن کامل